Fix use-after-free races in memory pool shrinker and DRM fence destruction#1004
Fix use-after-free races in memory pool shrinker and DRM fence destruction#1004neoyubi wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
During memory pressure, kswapd invokes shrinker callbacks via shrink_slab. A race condition exists where nv_mem_pool_destroy() can free the shrinker while kswapd is still iterating, causing the kernel to call corrupted function pointers and crash. Changes: - Move nv_mem_pool_shrinker_free() to execute FIRST in destroy sequence - Add synchronize_rcu() after shrinker unregistration to ensure all RCU readers have completed before continuing destruction - Set shrinker pointer to NULL after free to prevent dangling reference - Split DRM fence context destruction into prepare + final phases to signal fences before drm_gem_object_release() Tested on RTX 5090 with kernel 6.18.5 - system stable after fix.
|
Reading through this code, I believe the refactors to
|
|
Thanks for taking the time to look through this, appreciate it. To be transparent about the root cause: after more investigation, the crashes turned out to be triggered by a hardware misconfiguration on my end. The RTX 5090 was running on Gen4 config with PCIe bandwidth limited to x4 instead of x16 in BIOS. That bottleneck was what actually pushed things into the failure state. To answer your questions directly:
The fixes themselves are technically sound, the ordering issues in shrinker teardown and fence context destruction are real. But I understand they may not be worth the integration risk given they address an edge case that required a specific hardware misconfiguration to hit. Totally fine with closing this if that's the call. |
Fix use-after-free races in memory pool shrinker and DRM fence destruction
Summary
This patch fixes two related use-after-free race conditions that cause kernel crashes under memory pressure:
kswapdcan invoke shrinker callbacks whilenv_mem_pool_destroy()is freeing pool resourcesdma_resvwhile fence contexts are being destroyedBoth issues stem from the same root cause: cleanup callbacks not being stopped before the resources they access are released.
Issue 1: Memory Pool Shrinker Race
Problem
The shrinker is unregistered after freeing the pool's page lists:
Race Scenario
Fix
nv_mem_pool_shrinker_free()to the start of destructionsynchronize_rcu()after unregistration to ensure no callbacks are in-flight (kernel iterates shrinkers under RCU)Issue 2: DRM Fence Context Destruction Race
Problem
When a GEM object with an associated fence context is destroyed, the current code:
drm_gem_object_release()(releases dma_resv)This allows the kernel's drm_exec/shrinker infrastructure to access
dma_resvwhile fences are still active.Race Scenario
Fix
Introduce two-phase destruction for fence contexts:
prepare_release/prepare_destroy: Stop callbacks, timers, and signal all pending fences beforedrm_gem_object_release()free/destroy: Release NVKMS resources and free memory after the GEM object is fully releasedThis ensures fences are detached from
dma_resvbefore the kernel can no longer safely access them.Changes
nv-vm.c
nv_mem_pool_destroy()synchronize_rcu()aftershrinker_free()/unregister_shrinker()nvidia-drm-gem.h/c
prepare_releasecallback tonv_drm_gem_object_funcsprepare_releasebeforedrm_gem_object_release()innv_drm_gem_free()nvidia-drm-fence.c
prepare_destroycallback tonv_drm_fence_context_ops__nv_drm_prime_fence_context_destroy()into prepare/destroy phases__nv_drm_semsurf_fence_ctx_destroy()into prepare/destroy phases__nv_drm_fence_context_gem_prepare_release()to call prepare phaseTesting
kswapdpath through nvidia shrinker/fence callbacksImpact
These bugs affect all users of nvidia-open kernel modules under memory pressure. Symptoms include:
shrink_slab()ordrm_execpathsThe fixes follow established kernel conventions: unregister/stop callbacks before freeing the resources they access.
References
include/linux/shrinker.hdrivers/gpu/drm/drm_gem.ckernel-open/nvidia/nv-vm.ckernel-open/nvidia-drm/nvidia-drm-gem.ckernel-open/nvidia-drm/nvidia-drm-gem.hkernel-open/nvidia-drm/nvidia-drm-fence.c